1 Analyzing Climate Change through Temperature Anomalies

By using R Studio and ggplot to analyze temperature variances between 1951-1980, the effects of climate change are illustrated.

I will use date from Combined Land-Surface Air and Sea-Surface Water Temperature Anomalies in the Northern Hemisphere at NASA’s Goddard Institute for Space Studies. The tabular data of temperature anomalies can be found here

weather <- 
  read_csv("https://data.giss.nasa.gov/gistemp/tabledata_v4/NH.Ts+dSST.csv", #retrieving data 
           skip = 1, #skipping first line 
           na = "***") 

I have modified the data by adding skip and na, and we select the year and the twelve month variables from the dataset while deleting the rest as they are non-relevant.

1.1 Data Frame

tidyweather <- weather %>%  select(1:13) %>% pivot_longer(cols=2:13, names_to = "Month", values_to="delta")

tidyweather
## # A tibble: 1,704 × 3
##     Year Month delta
##    <dbl> <chr> <dbl>
##  1  1880 Jan   -0.35
##  2  1880 Feb   -0.51
##  3  1880 Mar   -0.23
##  4  1880 Apr   -0.3 
##  5  1880 May   -0.06
##  6  1880 Jun   -0.16
##  7  1880 Jul   -0.18
##  8  1880 Aug   -0.26
##  9  1880 Sep   -0.23
## 10  1880 Oct   -0.32
## # … with 1,694 more rows

1.2 Plot of Date and Delta Values

To illustrate the trend of temperature anomalies, I will plot the data using a time-series scatter plot, and have add a trendline.

# convert the date time datatype to plot chronologically
tidyweather <- tidyweather %>%
  mutate(date = ymd(paste(as.character(Year), Month, "1")), 
         month = month(date, label=TRUE),
         year = year(date))

# plotting the scatter plot of the data
ggplot(tidyweather, aes(x=date, y = delta))+
  geom_point()+
  geom_smooth(color="red") +
  theme_bw() +
  labs (
    title = "Weather Anomalies",
    x = "Date",
    y = "Temperature deviation"
  )+
  NULL

As evidenced in the plot, the change in temperature from out base year were negative before the 1970s and then temperatures started to increase steeply afterwards.

Visualizing the data on a per-month basis will help see if the effect of increasing temperature is more pronounced in some months than others.

#plotting the by-month scatter plot of data
ggplot(tidyweather, aes(x=date, y = delta))+
  geom_point()+
  geom_smooth(color="red") +
  geom_hline(yintercept = 0, color="orange")+
  theme_bw() +
  labs (
    title = "Weather Anomalies",
    x = "Date",
    y = "Temperature deviation"
  )+
  facet_wrap(~month)+       #Faceting on a per month basis 
  NULL

From the above chart one can see the temperature deviation is smaller from May to August compared to September to December. This means that the Spring and Summer seasons have been more prone to greater temperature anomalies than the Fall and Winter seasons.

To study the historical data, I will group data into different time periods.

comparison <- tidyweather %>% 
  filter(Year>= 1881) %>%     #remove years prior to 1881
  #create new variable 'interval', and assign values based on criteria below:
  mutate(interval = case_when(           #assigning difference periods 
    Year %in% c(1881:1920) ~ "1881-1920",
    Year %in% c(1921:1950) ~ "1921-1950",
    Year %in% c(1951:1980) ~ "1951-1980",
    Year %in% c(1981:2010) ~ "1981-2010",
    TRUE ~ "2011-present"
  ))

By clicking on it in the Environment pane, we inspected the dataframe and it has 7 columns: Year, Month, delta, date, month, year, interval which we just created.

Now that we have the interval variable, we can create a density plot to study the distribution of monthly deviations (delta), grouped by the different time periods we are interested in.

# Set `fill` to `interval` to group and colour the data by different time periods.
ggplot(comparison, aes(x=delta, fill=interval))+
  geom_density(alpha=0.2) +   #density plot with tranparency set to 20%
  theme_bw() +                #theme
  labs (
    title = "Density Plot for Monthly Temperature Anomalies",
    y     = "Density"         #changing y-axis label to sentence case
  )

- From this we can see that as time goes by, the average delta of climate change increases from negative to positive, indicating the temperature is increasing. We can also see the temperature is increasing at a larger rate since 1951.

We are also interested in average annual anomalies, therefore we further modified the data to produce a scatter plot as below:

#creating yearly averages
average_annual_anomaly <- tidyweather %>% 
  group_by(Year) %>%   #grouping data by Year
  
  # creating summaries for mean delta 
  # use `na.rm=TRUE` to eliminate NA (not available) values 
  summarise(annual_average_delta = mean(delta, na.rm=TRUE)) 

#plotting the data:
ggplot(average_annual_anomaly, aes(x=Year, y= annual_average_delta))+
  geom_point()+
  
  #Fit the best fit line, using LOESS method
  geom_smooth() +
  
  #change to theme_bw() to have white background + black frame around plot
  theme_bw() +
  labs (
    title = "Average Yearly Anomaly",
    y     = "Average Annual Delta"
  )                         

  • As we can see from the plot, it corresponded to our earlier conclusion that the temperature is increasing at a larger rate starting significantly since 1960s.

1.3 Confidence Interval for delta

NASA points out on their website that

A one-degree global change is significant because it takes a vast amount of heat to warm all the oceans, atmosphere, and land by that much. In the past, a one- to two-degree drop was all it took to plunge the Earth into the Little Ice Age.

We want to construct a confidence interval for the average annual delta since 2011, both using a formula and using a bootstrap simulation with the infer package. Recall that the dataframe comparison has already grouped temperature anomalies according to time intervals; we are only interested in what is happening between 2011-present.

formula_ci <- comparison %>% filter(interval =="2011-present") %>% 
  summarise(annual_average_delta = mean(delta, na.rm=TRUE),
            sd_delta = sd(delta, na.rm=TRUE),
            count = n(),
            se_delta = sd_delta/sqrt(count),
            t_critical = qt(0.975, count-1),
            margin_of_error = t_critical * se_delta,
            delta_low = annual_average_delta - margin_of_error,
            delta_high = annual_average_delta + margin_of_error)

  # choose the interval 2011-present
  # what dplyr verb will you use? 

  # calculate summary statistics for temperature deviation (delta) 
  # calculate mean, SD, count, SE, lower/upper 95% CI
  # what dplyr verb will you use? 

  # summarise(ann_aver_delta = mean(delta,na.rm=T),
  #           sd_delta = sd(delta,na.rm=T),
  #           count=n(),
  #           t_critical = qt(0.975,count-1),
  #           margin_of_error = t_critical*sd_delta/sqrt(count),
  #           delta_low = ann_aver_delta-margin_of_error,
  #           delta_high = ann_aver_delta+margin_of_error)
#print out formula_CI
formula_ci
## # A tibble: 1 × 8
##   annual_average_d… sd_delta count se_delta t_critical margin_of_error delta_low
##               <dbl>    <dbl> <int>    <dbl>      <dbl>           <dbl>     <dbl>
## 1              1.06    0.274   132   0.0239       1.98          0.0472      1.01
## # … with 1 more variable: delta_high <dbl>
# use the infer package to construct a 95% CI for delta
set.seed(1234)

boot_delta<- comparison %>%
  filter(interval =="2011-present") %>%
  specify(response = delta) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "mean") %>% 
  get_confidence_interval(level = 0.95, type = "percentile")
  
boot_delta
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1     1.01     1.11

First we filtered out the years to include only 2011-present. Then we calculated summary statistics including the mean,sd,se and from this we calculated the confidence interval.

  • Using both the bootstrap method and the formula method we got a 95% confidence interval for delta as (1.01,1.11).
  • This means that we are 95% confident that the true mean for 2011-present for delta lies within the range of (1.01,1.11). We can confirm that there is a net increase in temperature since the base year.

2 General Social Survey (GSS)

The General Social Survey (GSS) gathers data on American society in order to monitor and explain trends in attitudes, behaviours, and attributes. Many trends have been tracked for decades, so one can see the evolution of attitudes, etc in American Society.

We analyzed data from the 2016 GSS sample data, using it to estimate values of population parameters of interest about US adults. The GSS sample data file has 2867 observations of 935 variables, but we are only interested in very few of these variables and therefore we filtered the data into a smaller file.

gss <- read_csv(here::here("data", "smallgss2016.csv"), 
                na = c("", "Don't know",
                       "No answer", "Not applicable"))

We noticed that many responses should not be taken into consideration, like “No Answer”, “Don’t Know”, “Not applicable”, “Refused to Answer”.

We will be creating 95% confidence intervals for population parameters. The variables we have are the following: hours and minutes spent on email weekly. The responses to these questions are recorded in the emailhr and emailmin variables. For example, if the response is 2.50 hours, this would be recorded as emailhr = 2 and emailmin = 30. snapchat, instagrm, twitter: whether respondents used these social media in 2016 sex: Female - Male degree: highest education level attained

2.1 Instagram and Snapchat, by sex

We would like to estimate the population proportion of Snapchat or Instagram users in 2016, by following the below steps:

  1. Creating a new variable, snap_insta that is Yes if the respondent reported using any of Snapchat (snapchat) or Instagram (instagrm), and No if not. If the recorded value was NA for both of these questions, the value in your new variable should also be NA.

  2. Calculating the proportion of Yes’s for snap_insta among those who answered the question, i.e. excluding NAs.

  3. Using the CI formula for proportions, please construct 95% CIs for men and women who used either Snapchat or Instagram

#Step 1
gss<-gss %>% mutate(snap_insta = case_when(
  (snapchat =="Yes" | instagrm == "Yes") ~ "Yes",
  (snapchat =="No" & instagrm == "No") ~ "No",
  T ~ "NA"))

#Step 2
gss %>% 
  filter(!snap_insta == "NA") %>% 
  count(snap_insta, sort=TRUE) %>% 
  mutate(prop = n/sum(n))
## # A tibble: 2 × 3
##   snap_insta     n  prop
##   <chr>      <int> <dbl>
## 1 No           858 0.625
## 2 Yes          514 0.375
# se for proportion = sqrt (P*(1-P)/n)

#Step 3
se_prop = sqrt(
  0.625 * 0.375 / (514+858)
)

conf_low <-  0.375 - 1.96 * se_prop
conf_high <- 0.375 + 1.96 * se_prop

conf_high
## [1] 0.401
conf_low
## [1] 0.349
  • We are 95% confident that the true proportion of instagram or snap users falls between 0.349 and 0.401.

2.2 Twitter, by education level

Next, we estimated the population proportion of Twitter users by education level in 2016 in the following stages

1.We calculate the proportion of bachelor_graduate who do (Yes) and who don’t (No) use twitter. 2. Using the CI formula for proportions, we constructed two 95% CIs for bachelor_graduate vs whether they use (Yes) and don’t (No) use twitter. 3. We assessed whether or not these confidence intervals overlapped?

#Turn `degree` from a character variable into a factor variable
#Create a  new variable, `bachelor_graduate` to indicate if the respondent is bachelor or graduate
gss <- gss %>%  
  mutate(degree = factor(degree, levels=c("Lt high school","High school", "Junior college", "Bachelor", "Graduate")),
                       bachelor_graduate = case_when(
  (degree =="Bachelor" | degree == "Graduate") ~ "Yes",
  (degree != "Bachelor" & degree != "Graduate") ~ "No",
  T ~ "NA"))

#Calculate the proportion of `bachelor_graduate` who do (Yes) and who don't (No) use twitter
gss %>% filter(bachelor_graduate == "Yes") %>%
  filter(twitter != "NA") %>%
  count(twitter, sort=TRUE) %>% 
  mutate(prop = n/sum(n))
## # A tibble: 2 × 3
##   twitter     n  prop
##   <chr>   <int> <dbl>
## 1 No        375 0.767
## 2 Yes       114 0.233
#Using the CI formula for proportions
se_prop_usetwitter = sqrt(
  0.767 * 0.233 / (114+375)
)

se_prop_notwitter = sqrt(
  0.767 * 0.233 / (375+144)
)


conf_low_usetwitter <-  0.233 - qt(0.975, 113) * se_prop_usetwitter
conf_high_usetwitter <- 0.233 + qt(0.975, 113) * se_prop_usetwitter

conf_low_notwitter <-  0.767 - qt(0.975, 374) * se_prop_notwitter
conf_high_notwitter <- 0.767 + qt(0.975, 374) * se_prop_notwitter


conf_low_notwitter
## [1] 0.731
conf_high_notwitter
## [1] 0.803
conf_low_usetwitter
## [1] 0.195
conf_high_usetwitter
## [1] 0.271
  • The 95% confidence interval for the proportion of bachelors and graduates that do not use twitter is 0.731 to 0.803.

  • The 95% confidence interval for the proportion of bachelors and graduates that do use twitter is 0.195 to 0.271.

2.3 Email usage

Next we estimated the population parameter on time spent on email weekly.

  • The mean of the number of minutes spent on email weekly is 417 minutes and the median is 120 minutes. The median is a better measure of the typical amount of time Americans spent on email weekly as the mean is substantially increased due to the outliers of extremely high email usage. Multiple individuals claimed to spend more than 4000 minutes on email weekly which skews the mean higher. The median is a more accurate representation of typical usage as it isn’t affected by the outliers.

  • The 95% CI for the mean weekly email time is 6 hours and 25 minutes to 7 hours and 33 minutes. There does not seem to be an odd result. The mean of the dataset is 6hrs and 57 minutes which falls into this CI.

  • One would expect a 99% confidence interval to be wider than the 95% CI. This is because to be more certain about the values of the mean weekly email time it is necessary to increase the range and hence have a wider interval. Also mathematically the t-critical value is larger and hence the CI is wider.

set.seed(1234)
#Converts email hr and email min into one variable called email
gss <- gss %>%
  mutate(email = as.numeric(emailhr) * 60 + as.numeric(emailmin))
#Creates histogram showing the distribution of email
gss %>%
  ggplot(aes(x=email)) +
  geom_histogram() + labs(title = "Histogram of Time spent on email",
                          x= "Minutes spent on email")

gss %>% summarise(meanEmail = mean(email, na.rm=TRUE),
                  medianEmail = median(email, na.rm=TRUE))
## # A tibble: 1 × 2
##   meanEmail medianEmail
##       <dbl>       <dbl>
## 1      417.         120
#Bootstrap creating a 95% confidence interval for the mean weekly email usage.
boot_email <- gss %>%
  specify(response = email) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "mean") %>%
  get_confidence_interval(level = 0.95, type = "percentile")

boot_email
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1     385.     453.
hours = boot_email%/%60 #Calculates the hours using integer division (61 minutes will give 1 hour)
minutes = round(boot_email%%60) # Calculates hours using modulus division (getting remainder) (61 min will give 1 minute)

paste(hours, "hours", minutes, "minutes") # function to paste the values for the confidence interval: "6 hours 25 minutes" "7 hours 33 minutes"
## [1] "6 hours 25 minutes" "7 hours 33 minutes"
# wider 
#Bootstrap creating a 99% confidence interval for the mean weekly email usage.
# boot_email2 <- gss %>%
#   specify(response = email) %>%
#   generate(reps = 1000, type = "bootstrap") %>%
#   calculate(stat = "mean") %>%
#   get_confidence_interval(level = 0.99, type = "percentile")
# boot_email2

3 Biden’s Approval Margins

We will now start our analysis of Biden’s approval ratings. Fivethirtyeight.com has detailed data on all polls that track the president’s approval

# Import approval polls data directly off fivethirtyeight website
approval_polllist <- read_csv('https://projects.fivethirtyeight.com/biden-approval-data/approval_polllist.csv') 

glimpse(approval_polllist)
## Rows: 2,098
## Columns: 22
## $ president           <chr> "Joe Biden", "Joe Biden", "Joe Biden", "Joe Biden"…
## $ subgroup            <chr> "All polls", "All polls", "All polls", "All polls"…
## $ modeldate           <chr> "11/8/2021", "11/8/2021", "11/8/2021", "11/8/2021"…
## $ startdate           <chr> "1/19/2021", "1/19/2021", "1/20/2021", "1/20/2021"…
## $ enddate             <chr> "1/21/2021", "1/21/2021", "1/21/2021", "1/22/2021"…
## $ pollster            <chr> "Morning Consult", "Rasmussen Reports/Pulse Opinio…
## $ grade               <chr> "B", "B", "B", "B", "B-", "B+", "B+", "B", "B", "B…
## $ samplesize          <dbl> 15000, 1500, 1993, 15000, 1115, 1516, 941, 15000, …
## $ population          <chr> "a", "lv", "rv", "a", "a", "a", "rv", "a", "lv", "…
## $ weight              <dbl> 0.2594, 0.3382, 0.0930, 0.2333, 1.1014, 1.2454, 1.…
## $ influence           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ approve             <dbl> 50, 48, 56, 51, 55, 45, 63, 52, 48, 58, 55, 48, 53…
## $ disapprove          <dbl> 28, 45, 31, 28, 32, 28, 37, 29, 47, 32, 33, 47, 29…
## $ adjusted_approve    <dbl> 48.5, 50.4, 54.5, 49.5, 53.8, 46.4, 59.1, 50.5, 50…
## $ adjusted_disapprove <dbl> 31.3, 38.9, 34.3, 31.3, 33.1, 28.6, 38.1, 32.3, 40…
## $ multiversions       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ tracking            <lgl> TRUE, TRUE, NA, TRUE, NA, NA, NA, TRUE, TRUE, NA, …
## $ url                 <chr> "https://morningconsult.com/form/global-leader-app…
## $ poll_id             <dbl> 74272, 74247, 74246, 74273, 74248, 74327, 74256, 7…
## $ question_id         <dbl> 139491, 139395, 139394, 139492, 139404, 139570, 13…
## $ createddate         <chr> "1/28/2021", "1/22/2021", "1/22/2021", "1/28/2021"…
## $ timestamp           <chr> "12:50:12  8 Nov 2021", "12:50:12  8 Nov 2021", "1…
# Use `lubridate` to fix dates, as they are given as characters.

approval_polllist <- approval_polllist %>%
  mutate(modeldate = mdy(modeldate),
         startdate = mdy(startdate),
         enddate = mdy(enddate), 
         createddate = mdy(createddate))

3.1 Creating a Plot

Now we will calculate the average net approval rate (approve- disapprove) for each week since Biden got into office. We will plot Biden’s net approval, along with its 95% confidence interval. For the date, we will use enddate, i.e., the date the poll ended.

3.2 Comparing Confidence Intervals

  • In the bottom graph, the confidence intervals for week 3 and week 25 are very different. This could be due to a difference in sample size, or variation in responses. We believe it is more likely that the difference is due to week 3 having a much smaller sample size.

  • In the top graph we compared week 4 and week 25. There is not much of a difference in confidence intervals, suggesting the sample size is probably similar for both week 4 and week 25.

4 Gapminder revisited

Now we look at multiple data frames listed below-

  1. Life expectancy at birth (life_expectancy_years.csv)
  2. GDP per capita in constant 2010 US$ (https://data.worldbank.org/indicator/NY.GDP.PCAP.KD) 3.Female fertility: The number of babies per woman (https://data.worldbank.org/indicator/SP.DYN.TFRT.IN)
  3. Primary school enrollment as % of children attending primary school (https://data.worldbank.org/indicator/SE.PRM.NENR)
  4. Mortality rate, for under 5, per 1000 live births (https://data.worldbank.org/indicator/SH.DYN.MORT)
  5. HIV prevalence (adults_with_hiv_percent_age_15_49.csv): The estimated number of people living with HIV per 100 population of age group 15-49.
# load gapminder HIV data
hiv <- read_csv(here::here("data","adults_with_hiv_percent_age_15_49.csv"))
life_expectancy <- read_csv(here::here("data","life_expectancy_years.csv"))

# get World bank data using wbstats
indicators <- c("SP.DYN.TFRT.IN","SE.PRM.NENR", "SH.DYN.MORT", "NY.GDP.PCAP.KD")


library(wbstats)

worldbank_data <- wb_data(country="countries_only", #countries only- no aggregates like Latin America, Europe, etc.
                          indicator = indicators, 
                          start_date = 1960, 
                          end_date = 2016)

# get a dataframe of information regarding countries, indicators, sources, regions, indicator topics, lending types, income levels,  from the World Bank API 
countries <-  wbstats::wb_cachelist$countries

We had to join the 3 dataframes (life_expectancy, worldbank_data, and HIV) into one. We had to tidy your data first and then perform join operations. Thinking about what type made sense and explain why you chose it.

hiv_long <- hiv %>% 
  pivot_longer(cols = 2:34, #columns 2 to 34
               names_to = "Year",
               values_to = "HIV_Value")

skim(hiv_long)
Data summary
Name hiv_long
Number of rows 5082
Number of columns 3
_______________________
Column type frequency:
character 2
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
country 0 1 3 24 0 154 0
Year 0 1 4 4 0 33 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
HIV_Value 1781 0.65 1.74 4.09 0.01 0.1 0.3 1.2 26.5 ▇▁▁▁▁
hiv_long
## # A tibble: 5,082 × 3
##    country     Year  HIV_Value
##    <chr>       <chr>     <dbl>
##  1 Afghanistan 1979         NA
##  2 Afghanistan 1980         NA
##  3 Afghanistan 1981         NA
##  4 Afghanistan 1982         NA
##  5 Afghanistan 1983         NA
##  6 Afghanistan 1984         NA
##  7 Afghanistan 1985         NA
##  8 Afghanistan 1986         NA
##  9 Afghanistan 1987         NA
## 10 Afghanistan 1988         NA
## # … with 5,072 more rows
life_expectancy_long <- life_expectancy %>% 
                          pivot_longer(cols = 2:302, #columns 2 to 302
                       names_to = "Year",
                    values_to = "Life_Expectancy_Value")

skim(life_expectancy_long)
Data summary
Name life_expectancy_long
Number of rows 56287
Number of columns 3
_______________________
Column type frequency:
character 2
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
country 0 1 3 30 0 187 0
Year 0 1 4 4 0 301 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Life_Expectancy_Value 759 0.99 53 21.7 1.01 32.3 48.7 74.2 94.8 ▁▇▂▅▅
life_expectancy_long
## # A tibble: 56,287 × 3
##    country     Year  Life_Expectancy_Value
##    <chr>       <chr>                 <dbl>
##  1 Afghanistan 1800                   28.2
##  2 Afghanistan 1801                   28.2
##  3 Afghanistan 1802                   28.2
##  4 Afghanistan 1803                   28.2
##  5 Afghanistan 1804                   28.2
##  6 Afghanistan 1805                   28.2
##  7 Afghanistan 1806                   28.1
##  8 Afghanistan 1807                   28.1
##  9 Afghanistan 1808                   28.1
## 10 Afghanistan 1809                   28.1
## # … with 56,277 more rows
#Merged HIV & Life_Expectancy data by matching Year and Country
hiv_life_expectancy <- hiv_long %>% inner_join(  life_expectancy_long , by = c ( "country","Year" ))

#Renaming date column to Year on World Bank Data

worldbank_data <- rename(worldbank_data ,Year=date)


worldbank_data <- worldbank_data %>% 
  mutate(Year = as.character(Year))


#skim(hiv_life_expectancy)
#skim(worldbank_data)

#Merging Life Exp & HIV data with World bank data by Country & Year
hiv_life_expectancy_worldbank_data <- hiv_life_expectancy %>% inner_join( worldbank_data , by = c ( "country","Year" )  )
  1. We checked the relationship between HIV prevalence and life expectancy by generating a scatterplot with a smoothing line to report your results.
hiv_life_expectancy_worldbank_data <- hiv_life_expectancy_worldbank_data %>% 
  #create new variable 'interval', and assign values based on criteria below:
  mutate(interval = case_when(
    Year %in% c(1980:1989) ~ "1979-1989",
    Year %in% c(1990:2000) ~ "1990-2000",
    Year %in% c(2001:2010) ~ "2001-2010",
    #Year %in% c(1999:2009) ~ "1999-2009",
    TRUE ~ "2011-present"
  ))



ggplot(hiv_life_expectancy_worldbank_data, aes(x = HIV_Value , y = Life_Expectancy_Value)) +
  geom_point() +
  geom_smooth(method = "lm") +
   labs(
    title = "Relationship of HIV vs Life Expectancy",
    x = "HIV Value",
    y = "Life Expectancy"
  )+
  facet_wrap(~interval)+
  NULL

  • We have faceted the graphs by intervals. We see that as the HIV Value increases the life expectancy decreases.
  1. Now we analysed the relationship between fertility rate and GDP per capita? We generate a scatterplot with a smoothing line to report our results, using facetting as a useful tool.
library(countrycode)

#Creating a dataset hiv_life_expectancy_worldbank_data_continent which has a column for continents corresponding to each country

hiv_life_expectancy_worldbank_data_continent <- hiv_life_expectancy_worldbank_data

hiv_life_expectancy_worldbank_data_continent <- cbind(hiv_life_expectancy_worldbank_data, new_col = "continent") 


hiv_life_expectancy_worldbank_data_continent$continent <- countrycode(sourcevar = hiv_life_expectancy_worldbank_data_continent$country,
                            origin = "country.name",
                            destination = "continent")


ggplot(hiv_life_expectancy_worldbank_data_continent, aes(x = NY.GDP.PCAP.KD , y = SP.DYN.TFRT.IN)) +
  geom_point() +
  geom_smooth(method = "lm") +
   labs(
    title = "Relationship",
    x = "GDP",
    y = "Fertility"
  )+
  facet_wrap(~continent)+

  NULL

  • As we can see from plotting the GDP per capita against the fertility rate, when GDP per capita increases the fertility rate decreases substantially and reaches an upper bound at a fertility rate of 3

  • In Asia, Africa, Americas and Ocenia we see the trend that the higher the GDP the lower the Fertility rate

  • However, in Europe we see that the higher GDP countries also have a higher fertility rate

  1. Now we check which regions have the most observations with missing HIV data and generate a bar chart, in descending order.
#aggregate(x = HIV_Value, data=hiv_life_expectancy_worldbank_data, count(is.na(x)))


hiv_missing <- hiv_life_expectancy_worldbank_data %>%
  mutate(na_yes_no = ifelse( is.na(HIV_Value) , "Yes" , "No" )  ) %>% 
  group_by(country,na_yes_no) %>%
  summarise(count_missing=n()) %>%
  filter( na_yes_no == "Yes" ) %>%
  arrange(-count_missing)


hiv_missing[1:15,] %>%  
#  slice_max ( order_by = count_missing, n=5 ) %>%
  ggplot(aes(x = count_missing, y = fct_reorder(country, count_missing))) +
  geom_col(fill="orange")+
  labs(x="No. of missing values" , y="Country" , title= "Missing values of HIV data per country")+
    NULL

  1. Now we see how the mortality rate for under 5 has changed by region.

In each region, find the top 5 countries that have seen the greatest improvement, as well as those 5 countries where mortality rates have had the least improvement or even deterioration.

#now we filter the data from year 2011 and 1979
mortality_1979_2011 <- hiv_life_expectancy_worldbank_data_continent %>% 
                      filter ( as.numeric(Year) %in% c(  1979,2011 ) )

#now we select the columns to consider
mortality_1979_2011 <- select( mortality_1979_2011 , c( "Year", "continent" , "country" , "SH.DYN.MORT" ))

#we rename the columns to add a character 
mortality_1979_2011$Year = paste("y",mortality_1979_2011$Year,sep="")

#now we make the table wider to allow us to select the difference in values later
mortality_1979_2011_wide <- pivot_wider( data=mortality_1979_2011 , names_from = Year, values_from = SH.DYN.MORT  )


#now we add an additional column to calcualte the difference in mortality values
mortality_1979_2011_wide <- mortality_1979_2011_wide %>% mutate(mortality_diff = y2011 - y1979 ) 

#we group by the region and rank the mortality differences
mortality_1979_2011_ranked <- mortality_1979_2011_wide %>% 
                          group_by(continent) %>% 
                          summarise( country=country,  asc_ranking = rank(mortality_diff) , dsc_ranking = rank(-mortality_diff)  )


#as per our analysis the lesser mortality range difference means the worst improvement and the largest mortality rate differnce in 1979 and 2011 shows the best improved country

mortality_best5 <- mortality_1979_2011_ranked %>% filter( asc_ranking <=5  )

mortality_worst5 <- mortality_1979_2011_ranked %>% filter( dsc_ranking <=5  )


mortality_worst5 <-select( mortality_worst5  , c ( "continent" , "country"))

mortality_worst5
## # A tibble: 24 × 2
## # Groups:   continent [5]
##    continent country                 
##    <chr>     <chr>                   
##  1 Africa    Botswana                
##  2 Africa    Central African Republic
##  3 Africa    Lesotho                 
##  4 Africa    Mauritius               
##  5 Africa    Zimbabwe                
##  6 Americas  Barbados                
##  7 Americas  Canada                  
##  8 Americas  Costa Rica              
##  9 Americas  Cuba                    
## 10 Americas  United States           
## # … with 14 more rows
mortality_best5 <-select( mortality_worst5  , c ( "continent" , "country"))

mortality_best5
## # A tibble: 24 × 2
## # Groups:   continent [5]
##    continent country                 
##    <chr>     <chr>                   
##  1 Africa    Botswana                
##  2 Africa    Central African Republic
##  3 Africa    Lesotho                 
##  4 Africa    Mauritius               
##  5 Africa    Zimbabwe                
##  6 Americas  Barbados                
##  7 Americas  Canada                  
##  8 Americas  Costa Rica              
##  9 Americas  Cuba                    
## 10 Americas  United States           
## # … with 14 more rows
  1. We now check the relationship between primary school enrollment and fertility rate
ggplot(hiv_life_expectancy_worldbank_data_continent, aes(x = SP.DYN.TFRT.IN , y = SE.PRM.NENR)) +
  geom_point() +
  geom_smooth(method = "lm") +
   labs(
    title = "Relationship",
    x = "Fertility",
    y = "School Enrollment"
  )+
  facet_wrap(~continent)+
  NULL

Once again we see that Europe and the rest of the continents have different trends. - In Asia, Afirca, Americas and Oceania - the higher the fertility, the lesser is the school enrollment. - In Europe, the higher the fertility rate the more is the school enrollment.

5 Excess rentals in TfL bike sharing

Recall the TfL data on how many bikes were hired every single day. We can get the latest data by running the following

url <- "https://data.london.gov.uk/download/number-bicycle-hires/ac29363e-e0cb-47cc-a97a-e216d900a6b0/tfl-daily-cycle-hires.xlsx"

# Download TFL data to temporary file
httr::GET(url, write_disk(bike.temp <- tempfile(fileext = ".xlsx")))
## Response [https://airdrive-secure.s3-eu-west-1.amazonaws.com/london/dataset/number-bicycle-hires/2021-09-23T12%3A52%3A20/tfl-daily-cycle-hires.xlsx?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJJDIMAIVZJDICKHA%2F20211109%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20211109T084232Z&X-Amz-Expires=300&X-Amz-Signature=8db12e71e5ff5736af1529eb4e577e476340d6160f0a8d0c2d44e1d16117b972&X-Amz-SignedHeaders=host]
##   Date: 2021-11-09 08:43
##   Status: 200
##   Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
##   Size: 174 kB
## <ON DISK>  /var/folders/mv/p3nq9fm564nghj5prw0y4g_r0000gn/T//RtmpmkeQzO/file4a57d371b8e.xlsx
# Use read_excel to read it as dataframe
bike0 <- read_excel(bike.temp,
                   sheet = "Data",
                   range = cell_cols("A:B"))

# change dates to get year, month, and week
bike <- bike0 %>% 
  clean_names() %>% 
  rename (bikes_hired = number_of_bicycle_hires) %>% 
  mutate (year = year(day),
          month = lubridate::month(day, label = F),
          week = isoweek(day))

We can easily create a facet grid that plots bikes hired by month and year.

  • In May and June of 2020 there is a huge decline in bike rentals due to the pandemic.

We will now reproduce the following graph. The graph looks at the monthly change in TfL from the monthly averages calculated in 2016-2019. The blue line is the mean bike rentals of the months over 2016-2019. The red shaded region shows the months where the monthly rentals fell below the average and the green shows the months where it was above the average.

# Calculates the average monthly bikes rented using data from 2016 to 2021.
expected_hires <- bike %>% 
  filter(day>="2016-01-01")%>%
  group_by(year, month) %>% 
  summarize(bikes_hired = mean(bikes_hired)) %>%  #takes the daily data and creates a monthly mean for each year/month combo
  group_by(month) %>%
  summarise(expected_hired = mean(bikes_hired)) #outputs mean bike rentals by month (Jan-Dec) with only 12 rows

#modifying the dataset and adding the averages previously calculated in expected_hires
plot_bike <- bike %>% 
  filter(day>="2016-01-01")%>%
  group_by(year, month) %>%
  summarize(bikes_hired = mean(bikes_hired)) %>%  #gets the average bikes for each year/month combo 1/2016, 2/2016 ....
  inner_join(expected_hires, by = "month") %>% #adds column with the average bike rentals to each year/month combo
  mutate(fill = bikes_hired>expected_hired, #creates a True/Flase column if bikes rentals are above or below the average
         up = ifelse(bikes_hired>expected_hired, bikes_hired-expected_hired, 0), #calculates if above the average and the number of rentals above, if it is not 0 is given.
         down = ifelse(bikes_hired>expected_hired, 0,bikes_hired-expected_hired), #calculates if below the average and the number of rentals below, if it is not 0 is given.
         Month = month(month, label=T)) #gets the month value in chr format



plot_bike$lower = apply(plot_bike[,3:4],1,min) # creates a column taking the smallest value from actual vs average bikes hired
plot_bike$higher = apply(plot_bike[,3:4],1,max) # creates a column taking the largest value from actual vs average bikes hired
plot_bike$date = ym(paste(plot_bike$year,plot_bike$month)) #creates column with date in ym format

#Recreating the plot
plot_bike %>%
  ggplot(aes(x=Month)) +
  geom_line(aes( y=expected_hired, group=year),colour="blue",size=2)+ #draws the average
  geom_line(aes(y=bikes_hired, group=year),colour="black",size=.5)+ #draws the actual bikes hired
  geom_ribbon(aes(ymin=expected_hired,ymax=expected_hired+up, group=year),fill="#7DCD85",alpha=0.4)  + #creates green shaded
  geom_ribbon(aes(ymin=expected_hired,ymax=expected_hired+down, group=year),fill="#CB454A",alpha=0.4)+ #creates red shaded
  facet_wrap(~year)+ #creates plots for years
  theme(axis.text.x = element_text(angle=60 , hjust=1) ) +
 # theme_bw() + 
  labs(title = "Monthly changes in TfL bike rentals", 
                    subtitle = "change from monthly average shown in blue and calculated between 2016-2019", caption= "Source: TfL, London Data Store",
       x="", y="Bike Rentals") +
  NULL

The second graph we recreated looks at percentage change from the expected level of weekly rentals. The two grey shaded rectangles correspond to Q2 (weeks 14-26) and Q4 (weeks 40-52).

Here the green shaded region shows the percentage of rentals above the average and the red shows the percentage below.

#Calculating the weekly means
expected_hires_week <- bike %>% 
  filter(day>="2016-01-01" & day<"2020-01-01")%>%
  group_by(year, week) %>% 
  summarize(bikes_hired = mean(bikes_hired)) %>%  #takes the daily data and creates a weekly mean for each year/week combo
  group_by(week) %>%
  summarise(expected_hired = mean(bikes_hired)) #outputs mean bike rentals by week with 53 rows 


#modifying the dataset and adding the averages previously calculated in expected_hires
plot_bike_week <- bike %>% 
  filter(day>="2016-01-01")%>%
  filter(!(year==2021 & week==53)) %>% # gets rid of week 53 in 2021 causing weird line in plot.
  group_by(year, week) %>%
  summarize(bikes_hired = mean(bikes_hired)) %>%  
  inner_join(expected_hires_week, by = "week") %>% #joins the two datasets (one with mean) by week
  mutate(fillcolor = bikes_hired>expected_hired,
         excess_rentals = bikes_hired - expected_hired, #calculates rentals above average
         percentage_change_expected = (excess_rentals/expected_hired), #calcualtes percentage above avg
         up = ifelse(percentage_change_expected>0, (excess_rentals/expected_hired), 0), 
         down = ifelse(percentage_change_expected>0, 0,(excess_rentals/expected_hired)))

plot_bike_week$lower = apply(plot_bike_week[,3:4],1,min) # creates a column taking the smallest value from actual vs average bikes hired
plot_bike_week$higher = apply(plot_bike_week[,3:4],1,max) # creates a column taking the largest value from actual vs average bikes hired


plot_bike_week %>% ggplot(aes(x=week, y=percentage_change_expected)) +
  annotate(geom="rect", xmin = 14,xmax = 26, ymin=-Inf, ymax=Inf, alpha=0.1) + #Q2
  annotate(geom="rect", xmin = 40,xmax = 52, ymin=-Inf, ymax=Inf, alpha=0.1) + #Q4
  geom_line(aes(x=week, y=percentage_change_expected)) +  #Creates average line
  geom_ribbon(aes(ymin=0,ymax=up, group=year),fill="#7DCD85",alpha=0.4)  + #adds green shaded
  geom_ribbon(aes(ymin=0,ymax=down, group=year),fill="#CB454A",alpha=0.4)+ #adds red shaded
  geom_rug(aes(x=week), color=ifelse(plot_bike_week$fillcolor,"#7DCD85","#CB454A"), sides="b") +
  facet_wrap(~year) +
  scale_y_continuous(labels = scales::percent) + #adds percent on axis
  theme_bw() + 
  theme(legend.position = "none") +
  labs(title = "Weekly change in TfL bike rentals", 
                    subtitle = "% change from weekly averages calculated between 2016-2019", 
       y="")

Here we discuss whether to use the mean or the median to calculate your expected rentals and why.

  • In our graphs we calculate the expected number of rentals per week or month between 2016-2019 and then, see how each week/month of 2020-2021 compares to the expected rentals. Think of the calculation excess_rentals = actual_rentals - expected_rentals.
  • The bike rentals seem to be normally distributed and the mean is a good representation of the population mean.
  • The graphs are identical when the mean is used and since we are trying to replicate the graphs we have used the mean.

6 Details

Team Members: Alex Kubbinga, Clara Moreno Sanchez, Jean Huang, Raghav Mehta, Raina Doshi, Yuan Gao

  • Approximately how much time did you spend on this problem set: Too long (~10 hours)
  • What, if anything, gave you the most trouble: Adjusting tiny details of images is time-consuming